Introduction

The aim of this study is to predict tomorrow’s hourly consumption data with using predictive AR and MA models and considering seasonality and trend components. The data is acquired from EPİAŞ, from 1st of January, 2016 till the 20th of May, 2021. The last 14 days will be used for testing the models.

Data Reading & Preprocessing

Importing necessary libraries

library("readxl")
library("ggplot2")
library("tidyverse")
library("zoo")
library(plotly)
library("data.table")
library(forecast)
library(ggcorrplot)
library(stats)
library(urca)
library(lubridate)

Target variable is gathered as an csv file from the EPİAŞ:

df <- fread("RealTimeConsumption-01012016-20052021.csv")

str(df)
## Classes 'data.table' and 'data.frame':   47208 obs. of  3 variables:
##  $ Date             : chr  "01.01.2016" "01.01.2016" "01.01.2016" "01.01.2016" ...
##  $ Hour             : chr  "00:00" "01:00" "02:00" "03:00" ...
##  $ Consumption (MWh): chr  "26,277.24" "24,991.82" "23,532.61" "22,464.78" ...
##  - attr(*, ".internal.selfref")=<externalptr>
drop_na(df)
#converting to data table
setDT(df)

#combining Date and Hour column
df$NewDate <-dmy(df$Date)+hm(df$Hour)
#character to numeric values for consumption series
df$`Consumption (MWh)` <-gsub(',','',df$`Consumption (MWh)`)
df$`Consumption (MWh)`<-as.numeric(df$`Consumption (MWh)`)
df$year<-format(df$NewDate, "%y")
df$day<-format(df$NewDate, "%m/%d")
df$month<-format(df$NewDate, "%m")

There are no missing values.

##1. Possible types of seasonality exhibited by hourly electricity consumption The visualization of the time series data:

p <- ggplot(data = df, aes(x = NewDate, y = `Consumption (MWh)`,group=1))+
  geom_line(color = '#007cc3') + labs(title = 'Hourly Electricity Consumption in Turkey between 2016-2021', x= "Date", y = "Hourly Electricity Consumption(MWh)") + theme(text=element_text(color="#004785"),axis.text=element_text(color="#004785"))
fig <- ggplotly(p)

fig <- fig %>% layout(dragmode = "pan")

fig

For the preliminary analysis, it can be seen that the data has a monthly seasonality. For instance, the electricity consumption is highest in the summer season, i.e between June and September. On the other hand, yearly trend is stable and linear between 2016-2020. If zooming into monthly level in 2016:

p <- ggplot(data = df[1:8500,], aes(x = NewDate, y = `Consumption (MWh)`,group=1))+
  geom_line(color = '#007cc3') + labs(title = 'Hourly Electricity Consumption in 2016', x= "Date", y = "Hourly Electricity Consumption(MWh)") + theme(text=element_text(color="#004785"),axis.text=element_text(color="#004785"))
fig <- ggplotly(p)

fig <- fig %>% layout(dragmode = "pan")

fig

Now the monthly trend is more observable. The consumption is higher in summer months, with significant amount of energy consumption reduction is observable in 27 March, 07 June and 13 September. I couldn’t find any information about 27 March; however, in 07 June and 13 September, there were national holidays in Turkey. From that informartion, national holidays have a huge impact on energy consumtion, probably because of the reduction of the industrial electricity consumption. Zooming further into weekly, hourly and daily level:

p <- ggplot(data = df[1:745,], aes(x = NewDate, y = `Consumption (MWh)`,group=1))+
  geom_line(color = '#007cc3') + labs(title = 'Hourly Electricity Consumption in January 2016', x= "Date", y = "Hourly Electricity Consumption(MWh)") + theme(text=element_text(color="#004785"),axis.text=element_text(color="#004785"))
fig <- ggplotly(p)

fig <- fig %>% layout(dragmode = "pan")

fig
p <- ggplot(data = df[1:200,], aes(x = NewDate, y = `Consumption (MWh)`,group=1))+
  geom_line(color = '#007cc3') + labs(title = 'Hourly Electricity Consumption in 1-8 January 2016', x= "Date", y = "Hourly Electricity Consumption(MWh)") + theme(text=element_text(color="#004785"),axis.text=element_text(color="#004785"))
fig <- ggplotly(p)

fig <- fig %>% layout(dragmode = "pan")

fig

From these two figures, there is also a weekly seasonality, i.e the data pattern is obervable in every 7 days. Furthermore, the consumption has its peak between 10:00 AM - 18:00 PM in every day. Therefore hourly seasonality exists too.

To sum up, hourly, daily, weekly and monthly seasonality exist in the consumption data and should be taken into account in the predictive models. Also, the data has stable yearly linear trend. To further investigate and eliminate seasonality and trend components, the decompsition of the time series data in different levels(hourly, daily, weekly, monthly) is required, because ARIMA models work best if we know the seasonality components and make the data stationary. Since the variance is not increasing over the time, additive decomposition will be used. To check whether the data stationary or not, KPSS Unit Root Test is used:

KPSS_test= ur.kpss(df$`Consumption (MWh)`)
summary(KPSS_test)
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 18 lags. 
## 
## Value of test-statistic is: 12.695 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739
plot(KPSS_test)

The null hypothesis is the data is stationary, and the value of test-statistic is far bigger than cricical values. Therefore, the null hypothesis is rejected and the data is non-stationary. Furthermore, from the ACF

Hourly Decomposition

df_new <- ts(df$`Consumption (MWh)`,frequency=24)

df_additive <- decompose(df_new,type="additive")
plot(df_additive)

From the figure, seasonality is not clearly observable, but from the weekly figure at the beginning of the study, we know that the consumption has its peak between 10:00 AM - 18:00 PM in every day, meaning that there is a increasing trend in mornings and decreasing trend in nights. Therefore hourly seasonality exists.

Daily Decomposition

To analyze daily trends and seasonality, the hourly data is grouped into daily data by averaging 24 hours’ period.

daily_df <- df %>% group_by(year,day) %>% summarize(DailyAvg = mean(`Consumption (MWh)`))
## `summarise()` has grouped output by 'year'. You can override using the `.groups` argument.
df_new <- ts(daily_df$DailyAvg,frequency=7)

df_additive <- decompose(df_new,type="additive")
plot(df_additive)

The trend and seasonality is clearly observable from the above figure. Random component is similar to white noise, but large deviations can be observed in the national holidays.

Monhly Decomposition

To analyze daily trends and seasonality, the daily data is grouped into monthly data by averaging daily period.

monthly_df <- df %>% group_by(year,month,day) %>% summarize(DailyAvg = mean(`Consumption (MWh)`))
## `summarise()` has grouped output by 'year', 'month'. You can override using the `.groups` argument.
monthly_df <- monthly_df %>% group_by(year,month) %>% summarize(Monthly = mean(DailyAvg))
## `summarise()` has grouped output by 'year'. You can override using the `.groups` argument.
df_new <- ts(monthly_df$Monthly,frequency=12)

df_additive <- decompose(df_new,type="additive")
plot(df_additive)

For the monthly data, the seasonality and the trend line are very clearly observable.

2. Decomposing the Time-Series Based on the 168 Hours’ Period

Supposing that there is a pattern at every 168 hours(7 days), the decomposition of the time-series data:

df_new <- ts(df$`Consumption (MWh)`,frequency=168)

df_additive <- decompose(df_new,type="additive")
plot(df_additive)

For 7 days’ period, the energy consumption is lowest on Sundays and second lowest on Saturdays(can be seen on the daily interactive figures above. Not only the trend, but also the seasonality exists in the data, and should be eliminated for the predictive models.

3. Deseasonalizing and Detrendng the Time-series and Applying AR Models

For deseasonalizing and detrendng the Time-series, the trend and seasonal components should be extracted from the consumption data:

df_deseasoned_detrended <- df_new-df_additive$seasonal-df_additive$trend

p <- ggplot(data = df_deseasoned_detrended, aes(x = df$NewDate, y = df_deseasoned_detrended,group=1))+
  geom_line(color = '#007cc3') + labs(title = 'Deaseasoned and Detrended Data', x= "Date", y = "Hourly Electricity Consumption(MWh)") + theme(text=element_text(color="#004785"),axis.text=element_text(color="#004785"))
fig <- ggplotly(p)
## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous.
fig <- fig %>% layout(dragmode = "pan")

fig

The data is now more stationary and has no increasing or decreasing trend. The deviation from the mean is resulting mostly from national holidays. To determine appropriate AR,MA, and ARMA models, KPSS Unit Root Test can be applied for checking the stationarity of the data for the differencing (d) parameter. Also, ACF and PACF plots should be investigated for p and q parameters:

KPSS_test= ur.kpss(df_deseasoned_detrended)
summary(KPSS_test)
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 18 lags. 
## 
## Value of test-statistic is: 0.0042 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

The null hypothesis is accepted, since value of test-statistic is lower than the cricital values, hence the data is now stationary.

acf(df_deseasoned_detrended,na.action=na.pass)

pacf(df_deseasoned_detrended,na.action=na.pass)

From the ACF figure, the p parameter should be close to 1, and can be further extended into 6-7, and 24(as can be initially guessed). From the PACF figure, the q parameter should either be 1 or 2. For the training of the models, the test set(last 14 days) should be excluded from the dataset. Starting with AR models, i.e models with only autoregressive term(p):

Train-test set split:

df_train <- df_deseasoned_detrended[1:(length(df_deseasoned_detrended)-14*24)]
df_test <- df_deseasoned_detrended[(length(df_deseasoned_detrended)-14*24+1):length(df_deseasoned_detrended)]

AR Models:

model <- arima(df_train, order=c(1,0,0))
print(model)
## 
## Call:
## arima(x = df_train, order = c(1, 0, 0))
## 
## Coefficients:
##          ar1  intercept
##       0.9165     0.2836
## s.e.  0.0018    32.5934
## 
## sigma^2 estimated as 345752:  log likelihood = -364745.1,  aic = 729496.2
AIC(model)
## [1] 729496.2
model <- arima(df_train, order=c(2,0,0))
print(model)
## 
## Call:
## arima(x = df_train, order = c(2, 0, 0))
## 
## Coefficients:
##          ar1      ar2  intercept
##       1.3305  -0.4518     0.2830
## s.e.  0.0041   0.0041    20.0044
## 
## sigma^2 estimated as 275194:  log likelihood = -359405.7,  aic = 718819.4
AIC(model)
## [1] 718819.4
model <- arima(df_train, order=c(3,0,0))
print(model)
## 
## Call:
## arima(x = df_train, order = c(3, 0, 0))
## 
## Coefficients:
##          ar1      ar2      ar3  intercept
##       1.3229  -0.4294  -0.0168     0.2831
## s.e.  0.0046   0.0074   0.0046    19.6674
## 
## sigma^2 estimated as 275116:  log likelihood = -359399.1,  aic = 718808.2
AIC(model)
## [1] 718808.2
model <- arima(df_train, order=c(4,0,0))
print(model)
## 
## Call:
## arima(x = df_train, order = c(4, 0, 0))
## 
## Coefficients:
##          ar1      ar2     ar3      ar4  intercept
##       1.3223  -0.4431  0.0255  -0.0320     0.2833
## s.e.  0.0046   0.0077  0.0077   0.0046    19.0461
## 
## sigma^2 estimated as 274833:  log likelihood = -359375,  aic = 718762.1
AIC(model)
## [1] 718762.1
model <- arima(df_train, order=c(5,0,0))
print(model)
## 
## Call:
## arima(x = df_train, order = c(5, 0, 0))
## 
## Coefficients:
##          ar1      ar2     ar3      ar4     ar5  intercept
##       1.3225  -0.4432  0.0272  -0.0368  0.0036     0.2839
## s.e.  0.0046   0.0077  0.0079   0.0077  0.0046    19.1183
## 
## sigma^2 estimated as 274830:  log likelihood = -359374.7,  aic = 718763.5
AIC(model)
## [1] 718763.5
model <- arima(df_train, order=c(6,0,0))
print(model)
## 
## Call:
## arima(x = df_train, order = c(6, 0, 0))
## 
## Coefficients:
##          ar1      ar2     ar3      ar4     ar5      ar6  intercept
##       1.3226  -0.4445  0.0281  -0.0518  0.0482  -0.0338     0.2839
## s.e.  0.0046   0.0077  0.0079   0.0079  0.0077   0.0046    18.4739
## 
## sigma^2 estimated as 274517:  log likelihood = -359348.1,  aic = 718712.2
AIC(model)
## [1] 718712.2
#model <- arima(df_train, order=c(24,0,0))
#print(model)
#AIC(model)

The lowest AIC is achieved by p=24.However, due to high computational complexity, p=6 model will be further utilized.

4. Applying MA Models

As mentioned before, from the PACF figure, the q parameter should either be 1 or 2 for the MA models. However, increasing q value even more would be a good idea and should be checked too.

model <- arima(df_train, order=c(0,0,1))
print(model)
## 
## Call:
## arima(x = df_train, order = c(0, 0, 1))
## 
## Coefficients:
##          ma1  intercept
##       0.8527     0.2825
## s.e.  0.0018     7.5209
## 
## sigma^2 estimated as 770951:  log likelihood = -383504.5,  aic = 767015
AIC(model)
## [1] 767015
model <- arima(df_train, order=c(0,0,2))
print(model)
## 
## Call:
## arima(x = df_train, order = c(0, 0, 2))
## 
## Coefficients:
##          ma1     ma2  intercept
##       1.2002  0.6130     0.2827
## s.e.  0.0037  0.0029     8.8227
## 
## sigma^2 estimated as 460259:  log likelihood = -371437.3,  aic = 742882.6
AIC(model)
## [1] 742882.6
model <- arima(df_train, order=c(0,0,3))
print(model)
## 
## Call:
## arima(x = df_train, order = c(0, 0, 3))
## 
## Coefficients:
##          ma1     ma2     ma3  intercept
##       1.2839  1.0058  0.4755     0.2827
## s.e.  0.0043  0.0047  0.0035    10.3149
## 
## sigma^2 estimated as 351146:  log likelihood = -365107.3,  aic = 730224.5
AIC(model)
## [1] 730224.5
model <- arima(df_train, order=c(0,0,4))
print(model)
## 
## Call:
## arima(x = df_train, order = c(0, 0, 4))
## 
## Coefficients:
##          ma1     ma2     ma3     ma4  intercept
##       1.3153  1.1615  0.7726  0.3117     0.2832
## s.e.  0.0049  0.0066  0.0050  0.0041    11.7820
## 
## sigma^2 estimated as 312144:  log likelihood = -362353,  aic = 724718
AIC(model)
## [1] 724718
model <- arima(df_train, order=c(0,0,5))
print(model)
## 
## Call:
## arima(x = df_train, order = c(0, 0, 5))
## 
## Coefficients:
##          ma1     ma2     ma3     ma4     ma5  intercept
##       1.3274  1.2583  1.0048  0.6145  0.2441     0.2840
## s.e.  0.0045  0.0069  0.0071  0.0061  0.0041    13.5799
## 
## sigma^2 estimated as 290538:  log likelihood = -360675,  aic = 721363.9
AIC(model)
## [1] 721363.9
model <- arima(df_train, order=c(0,0,6))
print(model)
## 
## Call:
## arima(x = df_train, order = c(0, 0, 6))
## 
## Coefficients:
##          ma1     ma2     ma3     ma4     ma5     ma6  intercept
##       1.3326  1.3004  1.0883  0.7385  0.4029  0.1414     0.2843
## s.e.  0.0046  0.0074  0.0086  0.0081  0.0064  0.0043    14.8003
## 
## sigma^2 estimated as 284398:  log likelihood = -360175.3,  aic = 720366.6
AIC(model)
## [1] 720366.6
model <- arima(df_train, order=c(0,0,7))
print(model)
## 
## Call:
## arima(x = df_train, order = c(0, 0, 7))
## 
## Coefficients:
##          ma1     ma2     ma3     ma4     ma5     ma6     ma7  intercept
##       1.3124  1.2467  1.0475  0.7859  0.5725  0.3799  0.1772     0.2857
## s.e.  0.0049  0.0084  0.0091  0.0078  0.0100  0.0123  0.0075    15.9510
## 
## sigma^2 estimated as 279806:  log likelihood = -359794.5,  aic = 719606.9
AIC(model)
## [1] 719606.9

The AIC is increasing very slowly after the q=5, therefore q=5 is enough for the model, with considering the complexity and speed of the model too.

5. Comparison of AR and MA Models and Training and Prediction of ARMA Models

Comparing AIC values of the best AR and MA models, AR model with p=24 and p=6 has a lower AIC value compered to MA model with q=5. To obtain an even better model, let’s start with an ARMA model with p=6 and q=5. With p=24, my computer can not train the model due to high computational complexity:

model <- arima(df_train, order=c(6,0,5))
## Warning in log(s2): NaNs produced
print(model)
## 
## Call:
## arima(x = df_train, order = c(6, 0, 5))
## 
## Coefficients:
##          ar1     ar2      ar3     ar4      ar5      ar6     ma1     ma2     ma3
##       0.6564  0.3405  -0.3684  0.2533  -0.0860  -0.0438  0.6670  0.0990  0.2298
## s.e.  0.1810  0.2569   0.0675  0.1071   0.1311   0.0480  0.1814  0.0613  0.0479
##           ma4      ma5  intercept
##       -0.0314  -0.0372     0.3594
## s.e.   0.0874   0.0143    18.8209
## 
## sigma^2 estimated as 274401:  log likelihood = -359338.2,  aic = 718702.4
AIC(model)
## [1] 718702.4
BIC(model)
## [1] 718816.2

The AIC value is better(lower) from the both best AR and MA models. To check residuals:

checkresiduals(model)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(6,0,5) with non-zero mean
## Q* = 1640.8, df = 3, p-value < 2.2e-16
## 
## Model df: 12.   Total lags used: 15

The autocorrelation is not completely eliminated; however, the residuals are normally distributed. To compare actual values, and predicted values of the test set, the visualization should be realized. For the seasonality, I took the first 1424 period, since it has a replying pattern for each of the 336 data points. Also for trend component, I took the last 1424 trend values, since the last

model_forecast <- predict(model, n.ahead = 14*24)$pred
last_trend_value <-tail(df_additive$trend[!is.na(df_additive$trend)],14*24)

seasonality <- df_additive$seasonal[1:(14*24)]
forecast_combined <-model_forecast+last_trend_value+seasonality
df$forecast <- 0
df$forecast[(length(df_deseasoned_detrended)-14*24+1):length(df_deseasoned_detrended)] <-forecast_combined


p <- ggplot()+
  geom_line(data = df[46500:47208],aes(x = NewDate,y=`Consumption (MWh)`,color='Actual Data') ) +
  geom_line(data = df[(47208-24*14+1):47208],aes(x = NewDate,y=forecast,color='Prediction')) +
                     scale_color_manual(values = c(
    'Actual Data' = '#007cc3',
    'Prediction' = 'red')) +   labs(title = 'Hourly Electricity Consumption(MWh) Prediction', x= "Date", y = "Hourly Electricity Consumption(MWh)",color = 'Legend')

 theme(text=element_text(color="#004785"),axis.text=element_text(color="#004785"))
## List of 2
##  $ text     :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : chr "#004785"
##   ..$ size         : NULL
##   ..$ hjust        : NULL
##   ..$ vjust        : NULL
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : NULL
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi FALSE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ axis.text:List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : chr "#004785"
##   ..$ size         : NULL
##   ..$ hjust        : NULL
##   ..$ vjust        : NULL
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : NULL
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi FALSE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  - attr(*, "class")= chr [1:2] "theme" "gg"
##  - attr(*, "complete")= logi FALSE
##  - attr(*, "validate")= logi TRUE
fig <- ggplotly(p)

fig <- fig %>% layout(dragmode = "pan")

fig

The prediction is accurate for representing daily changes, but fails to represent increasing or decreasing trend over the days. It is more successful for predicting first 5 days, but then the daily trend is falsely predicted by the model. To evaulate the model and to measure overall accuracy of the model daily mean absolute percentage error for each date and weighted mean absolute percentage error (WMAPE) are calculated:

metrics <- df[(47208-24*14+1):47208] %>% group_by(year,day) %>% summarize(MAPE = mean(abs((`Consumption (MWh)`-forecast)/`Consumption (MWh)`)) * 100)
## `summarise()` has grouped output by 'year'. You can override using the `.groups` argument.
metrics

As can be seen above, MAPE is lower on the especially first 3 days.

wmape <- abs(sum((df[(47208-24*14+1):47208]$`Consumption (MWh)`-df[(47208-24*14+1):47208]$forecast)*metrics$MAPE*100))/sum(df[(47208-24*14+1):47208]$`Consumption (MWh)`)
wmape
## [1] 10.19357

WMAPE is approximately 10%, which is understandable based on the wrong predictions espcially after 3 days.

6. Conclusion

In conclusion, It is understood that the hourly electricity consumption data has highly predictable nature, and with the right model, hourly, daily or monthly prediction is achievable, even with simple ARMA methods. For the future work, SARIMA models can be introduced and autocorrelation can be eliminated with AR models with high p parameter.